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Abstract 

Recent studies reveal that a deep neural network 
can learn transferable features which generalize 
well to novel tasks for domain adaptation. How¬ 
ever, as deep features eventually transition from 
general to specific along the network, the feature 
transferability drops significantly in higher layers 
with increasing domain discrepancy. Hence, it is 
important to formally reduce the dataset bias and 
enhance the transferability in task-specific layers. 

In this paper, we propose a new Deep Adaptation 
Network (DAN) architecture, which generalizes 
deep convolutional neural network to the domain 
adaptation scenario. In DAN, hidden representa¬ 
tions of all task-specific layers are embedded in a 
reproducing kernel Hilbert space where the mean 
embeddings of different domain distributions can 
be explicitly matched. The domain discrepancy 
is further reduced using an optimal multi-kernel 
selection method for mean embedding matching. 

DAN can learn transferable features with statisti¬ 
cal guarantees, and can scale linearly by unbiased 
estimate of kernel embedding. Extensive empiri¬ 
cal evidence shows that the proposed architecture 
yields state-of-the-art image classification error 
rates on standard domain adaptation benchmarks. 

1. Introduction 

The generalization error of supervised learning machines 
with limited training samples will be unsatisfactorily large, 
while manual labeling of sufficient training data for diverse 
application domains may be prohibitive. Therefore, there is 
incentive to establishing effective algorithms to reduce the 
labeling cost, typically by leveraging off-the-shelf labeled 
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data from relevant source domains to the target domains. 
Domain adaptation addresses the problem that we have data 
from two related domains but under different distributions. 
The domain discrepancy poses a major obstacle in adapting 
predictive models across domains. Eor example, an object 
recognition model trained on manually annotated images 
may not generalize well on testing images under substantial 
variations in the pose, occlusion, or illumination. Domain 
adaptation establishes knowledge transfer from the labeled 
source domain to the unlabeled target domain by exploring 
domain-invariant structures that bridge different domains 
of substantial distribution discrepancy (Pan & Yang, 2010). 

One of the main approaches to establishing knowledge 
transfer is to learn domain-invariant models from data, 
which can bridge the source and target domains in an iso¬ 
morphic latent feature space. In this direction, a fruitful line 
of prior work has focused on learning shallow features by 
jointly minimizing a distance metric of domain discrepancy 
(Pan et ah, 2011; Long et al., 2013; Baktashmotlagh et ah, 
2013; Gong et al., 2013; Zhang et al., 2013; Ghifary et al., 
2014; Wang & Schneider, 2014). However, recent studies 
have shown that deep neural networks can learn more trans¬ 
ferable features for domain adaptation (Glorot et al., 2011; 
Donahue et al., 2014; Yosinski et al., 2014), which produce 
breakthrough results on some domain adaptation datasets. 
Deep neural networks are able to disentangle exploratory 
factors of variations underlying the data samples, and group 
features hierarchically in accordance with their relatedness 
to invariant factors, making representations robust to noise. 

While deep neural networks are more powerful for learning 
general and transferable features, the latest findings also re¬ 
veal that the deep features must eventually transition from 
general to specific along the network, and feature transfer- 
ability drops significantly in higher layers with increasing 
domain discrepancy. In other words, the features computed 
in higher layers of the network must depend greatly on 
the specific dataset and task (Yosinski et al., 2014), which 
are task-specific features and are not safely transferable to 
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novel tasks. Another curious phenomenon is that disentan¬ 
gling the variational factors in higher layers of the network 
may enlarge the domain discrepancy, as different domains 
with the new deep representations become more “compact” 
and are more mutually distinguishable (Glorot et al., 2011). 
Although deep features are salient for discrimination, en¬ 
larged dataset bias may deteriorate domain adaptation per¬ 
formance, resulting in statistically unbounded risk for the 
target tasks (Mansour et al., 2009; Ben-David et al., 2010). 

Inspired by the literature’s latest understanding about the 
transferability of deep neural networks, we propose in this 
paper a new Deep Adaptation Network (DAN) architecture, 
which generalizes deep convolutional neural network to the 
domain adaptation scenario. The main idea of this work 
is to enhance the feature transferability in the task-specific 
layers of the deep neural network by explicitly reducing 
the domain discrepancy. To establish this goal, the hidden 
representations of all the task-specific layers are embedded 
to a reproducing kernel Hilbert space where the mean em¬ 
beddings of different domain distributions can be explicitly 
matched. As mean embedding matching is sensitive to the 
kernel choices, an optimal multi-kernel selection procedure 
is devised to further reduce the domain discrepancy. In ad¬ 
dition, we implement a linear-time unbiased estimate of the 
kernel mean embedding to enable scalable training, which 
is very desirable for deep learning. Finally, as deep models 
pre-trained with large-scale repositories such as ImageNet 
(Russakovsky et al., 2014) are representative for general- 
purpose tasks (Yosinski et al., 2014; Hoffman et al., 2014), 
the proposed DAN model is trained by fine-tuning from 
the AlexNet model (Krizhevsky et al., 2012) pre-trained on 
ImageNet, which is implemented in Caffe (Jia et al., 2014). 
Comprehensive empirical evidence demonstrates that the 
proposed architecture outperforms state-of-the-art results 
evaluated on the standard domain adaptation benchmarks. 

The contributions of this paper are summarized as fol¬ 
lows. (1) We propose a novel deep neural network archi¬ 
tecture for domain adaptation, in which all the layers cor¬ 
responding to task-specific features are adapted in a lay- 
erwise manner, hence benefiting from “deep adaptation.” 
(2) We explore multiple kernels for adapting deep represen¬ 
tations, which substantially enhances adaptation effective¬ 
ness compared to single kernel methods. Our model can 
yield unbiased deep features with statistical guarantees. 

2. Related Work 

A related literature is transfer learning (Pan & Yang, 2010), 
which builds models that bridge different domains or tasks, 
explicitly taking domain discrepancy into consideration. 
Transfer learning aims to mitigate the effort of manual la¬ 
beling for machine learning (Pan et al., 2011; Gong et al., 
2013; Zhang et al., 2013; Wang & Schneider, 2014) and 


computer vision (Saenko et al., 2010; Gong et al., 2012; 
Baktashmotlagh et al., 2013; Longetal., 2013), etc. It is 
widely recognized that the domain discrepancy in the prob¬ 
ability distributions of different domains should be for¬ 
mally measured and reduced. The major bottleneck is how 
to match different domain distributions effectively. Most 
existing methods learn a new shallow representation model 
in which the domain discrepancy can be explicitly reduced. 
However, without learning deep features which can sup¬ 
press domain-specific factors, the transferability of shallow 
features could be limited by the task-specific variability. 

Deep neural networks learn nonlinear representations that 
disentangle and hide different explanatory factors of varia¬ 
tion behind data samples (Bengio et al., 2013). The learned 
deep representations manifest invariant factors underlying 
different populations and are transferable from the original 
tasks to similar novel tasks (Yosinski et al., 2014). Hence, 
deep neural networks have been explored for domain adap¬ 
tation (Glorot et al., 2011; Chenetal., 2012), multimodal 
and multi-source learning problems (Ngiametal., 2011; 
Ge et al., 2013), where significant performance gains have 
been obtained. However, all these methods depend on the 
assumption that deep neural networks can learn invariant 
representations that are transferable across different tasks. 
In reality, the domain discrepancy can be alleviated, but 
not removed, by deep neural networks (Glorot et al., 2011). 
Dataset shift has posed a bottleneck to the transferability of 
deep networks, resulting in statistically unbounded risk for 
target tasks (Mansour et al., 2009; Ben-David et al., 2010). 

Our work is primarily motivated by Yosinski et al. (2014), 
which comprehensively explores feature transferability of 
deep convolutional neural networks. The method focuses 
on a different scenario where the learning tasks are differ¬ 
ent across domains, hence it requires sufficient target la¬ 
beled examples such that the source network can be fine- 
tuned to the target task. In many real problems, labeled 
data is usually limited especially for a novel target task, 
hence the method cannot be directly applicable to domain 
adaptation. There are several very recent efforts in learning 
domain-invariant features in the context of shallow neural 
networks (Ajakan et al., 2014; Ghifary et al., 2014). Due 
to the limited capacity of shallow architectures, the per¬ 
formance of these proposals does not surpass deep CNN 
(Krizhevsky et al., 2012). Tzeng et al. (2014) proposed a 
DDC model that adds an adaptation layer and a dataset shift 
loss to the deep CNN for learning a domain-invariant rep¬ 
resentation. While performance was improved, DDC only 
adapts a single layer of the network, which may be restric¬ 
tive in that there are multiple layers where the hidden fea¬ 
tures are not transferable (Yosinski et al., 2014). DDC is 
also limited by suboptimal kernel matching of probability 
distributions (Gretton et al., 2012b) and its quadratic com¬ 
putational cost that restricts transferability and scalability. 
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3. Deep Adaptation Networks 

In unsupervised domain adaptation, we are given a source 
domain Vs = { (x®, yf)}2Li with Ug labeled examples, and 
a target domain Vt = with rit unlabeled exam¬ 

ples. The source domain and target domain are charac¬ 
terized by probability distributions p and q, respectively. 
We aim to construct a deep neural network which is able 
to learn transferable features that bridge the cross-domain 
discrepancy, and build a classiber y = 0(x) which can 
minimize target risk ct {0) = [6 (x) ^ y] using 

source supervision. In semi-supervised adaptation where 
the target has a small number of labeled examples, we de¬ 
note by Va = {{xf,yf)} the Ua annotated examples of 
source and target domains. 

3.1. Model 

MK-MMD Domain adaptation is challenging in that the 
target domain has no (or only limited) labeled information. 
To approach this problem, many existing methods aim to 
bound the target error by the source error plus a discrepancy 
metric between the source and the target (Ben-David et ak, 
2010). Two classes of statistics have been explored for 
the two-sample testing, where acceptance or rejection deci¬ 
sions are made for a null hypothesis p = q, given samples 
generated respectively from p and q: energy distances and 
maximum mean discrepancies (MMD) (Sejdinovic et ak, 
2013). In this paper, we focus on the multiple kernel variant 
of MMD (MK-MMD) proposed by Gretton et ak (2012b), 
which is formalized to jointly maximize the two-sample 
test power and minimize the Type II error, i.e., the failure 
of rejecting a false null hypothesis. 

Denote by T-Lk be the reproducing kernel Hilbert space 
(RKHS) endowed with a characteristic kernel k. The mean 
embedding of distribution p in T-Lk is a unique element 
p,k{p) such that Ex„.p/(x) = {f {x) , pk {p))^^ for all 
/ S T-Lk- The MK-MMD dk ijp, q) between probability dis¬ 
tributions p and q is defined as the RKHS distance between 
the mean embeddings of p and q. The squared formulation 
of MK-MMD is defined as 

dk{p,q) = ||Ep[(/.(x®)] -E, [<^(x*)]||^^ . (1) 

The most important property is that p = g iff (p, g) = 0 
(Gretton et ak, 2012a). The characteristic kernel associated 
with the feature map </>, k (x®, x*) = (cj) (x®), (p (x*)), is 
defined as the convex combination of m PSD kernels {fc„}, 

{ mm \ 

fc = = l,/3„ > 0,Vu I , (2) 

u=l U=1 ) 

where the constraints on coefficients {/3u} are imposed to 
guarantee that the derived multi-kernel k is characteristic. 
As studied theoretically in Gretton et ak (2012b), the kernel 



Figure 1. The DAN architecture for learning transferable features. 
Since deep features eventually transition from general to specific 
along the network, (1) the features extracted by convolutional lay¬ 
ers convl-ccmvS are general, hence these layers are frozen, (2) 
the features extracted by layers convA-convd are slightly less 
transferable, hence these layers are learned via fine-tuning, and 
(3) fully connected layers /c6-/c8 are tailored to fit specific 
tasks, hence they are not transferable and should be adapted with 
MK-MMD. 


adopted for the mean embeddings of p and g is critical to 
ensure the test power and low test error. The multi-kernel 
k can leverage different kernels to enhance MK-MMD test, 
leading to a principled method for optimal kernel selection. 

One of the feasible strategies for controlling the domain 
discrepancy is to find an abstract feature representation 
through which the source and target domains are simi¬ 
lar (Ben-David et ak, 2010). Although this idea has been 
explored in several papers (Pan et ak, 2011; Zhang et ak, 
2013; Wang & Schneider, 2014), to date there has been no 
attempt to enhance the transferability of feature representa¬ 
tion via MK-MMD in deep neural networks. 

Deep Adaptation Networks (DAN) In this paper, we ex¬ 
plore the idea of MK-MMD-based adaptation for learning 
transferable features in deep networks. We start with deep 
convolutional neural networks (CNN) (Krizhevsky et ak, 
2012), a strong model when it is adapted to novel tasks 
(Donahue et ak, 2014; Hoffman etak, 2014). The main 
challenge is that the target domain has no or just limited 
labeled information, hence directly adapting CNN to the 
target domain via fine-tuning is impossible or is prone to 
over-fitting. With the idea of domain adaptation, we are 
targeting a deep adaptation network (DAN) that can exploit 
both source-labeled data and target-unlabeled data. Fig¬ 
ure 1 gives an illustration of the proposed DAN model. 

We extend the AlexNet architecture (Krizhevsky et ak, 
2012), which is comprised of five convolutional layers 
(convl-convB) and three fully connected layers (/c6- 
/c8). Each fc layer £ learns a nonlinear mapping hf = 
f -k b^), where hf is the fth layer hidden rep¬ 
resentation of point Xi, and are the weights and bias 

of the £th layer, and is the activation, taking as recti- 
her units /^(x) = max(0, x) for hidden layers or softmax 
units (x) = e^/ output layer. Letting 
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0 = {W^, denote the set of all CNN parameters, 

the empirical risk of CNN is 

e 

where J is the cross-entropy loss function, and 0 (x“) is 
the conditional probability that the CNN assigns x“ to la¬ 
bel yf. We will not discuss how to compute the convolu¬ 
tional layers as we will not impose distribution-adaptation 
regularization in those layers, given that the convolutional 
layers can learn generic features that tend to be transferable 
in layers convl-conv?) and are slightly domain-biased in 
convA-convb (Yosinski et ah, 2014). Hence, when adapt¬ 
ing the pre-trained AlexNet to the target, we opt to freeze 
convl-conv'i and fine-tune convi-convB to preserve the 
efficacy of fragile co-adaptation (Hinton et al., 2012). 

In standard CNNs, deep features must eventually transition 
from general to specific by the last layer of the network, and 
the transferability gap grows with the domain discrepancy 
and becomes particularly large when transferring the higher 
layers /c6-/c8 (Yosinski et al., 2014). In other words, the 
fc layers are tailored to their original task at the expense of 
degraded performance on the target task, hence they cannot 
be directly transferred to the target domain via fine-tuning 
with limited target supervision. In this paper, we fine-tune 
CNN on the source labeled examples and require the dis¬ 
tributions of the source and target to become similar under 
the hidden representations of fully connected layers /c6- 
/c8. This can be realized by adding an MK-MMD-based 
multi-layer adaptation regularizer (1) to the CNN risk (3): 

, ria. h 

- E ^ E ^0 > (4) 

i=l i=h 

where A > 0 is a penalty parameter, li and I 2 are layer in¬ 
dices between which the regularizer is effective. In our im¬ 
plementation of DAN, we set =6 and I 2 = S, although 
different configurations are also possible, depending on the 
size of the labeled source dataset and the number of param¬ 
eters in the layers that are to be fine-tuned. = { h*^ } is 
the £th layer hidden representation for the source and target 
examples, and is the MK-MMD between the 

source and target evaluated on the fth layer representation. 

Training a deep CNN requires a large amount of labeled 
data, which is prohibitive for many domain adaptation 
problems, hence we start with an AlexNet model pre¬ 
trained on ImageNet 2012 and fine-tune it as in Yosinski 
et al. (2014). With the proposed DAN optimization frame¬ 
work (4), we are able to learn transferable features from a 
source domain to a related target domain. The learned rep¬ 
resentation can both be salient benefiting from CNN, and 
unbiased thanks to MK-MMD. Two important advantages 


that distinguish DAN from relevant literature are; (1) multi¬ 
layer adaptation. As revealed by (Yosinski et al., 2014), 
feature transferability gets worse on convA-convb and sig¬ 
nificantly drops on /c6-/c8, hence it is critical to adapt 
multiple layers instead of only one layer. In other words, 
adapting a single layer cannot undo the dataset bias be¬ 
tween the source and the target, since there are other lay¬ 
ers that are not transferable. Another benefit of multi-layer 
adaptation is that by jointly adapting the representation lay¬ 
ers and the classifier layer, we could essentially bridge the 
domain discrepancy underlying both the marginal distribu¬ 
tion and the conditional distribution, which is crucial for 
domain adaptation (Zhang et al., 2013). (2) multi-kernel 
adaptation. As pointed out by Gretton et al. (2012b), kernel 
choice is critical to the testing power of MMD since differ¬ 
ent kernels may embed probability distributions in different 
RKHSs where different orders of sufficient statistics can be 
emphasized. This is crucial for moment matching, which is 
not well explored by previous domain adaptation methods. 

3.2. Algorithm 

Learning 0 Using the kernel trick, MK-MMD ( 1) can be 
computed as the expectation of kernel functions {p, q) = 
Ex'>x"’fc(x®, x'®) -f Extx'*fc(x‘, x'*) — 2Ex«x*fc(x'’, x‘), 

where x®, x'* p, x*, x'* q, and k € K.. However, this 
computation incurs a complexity of O(n^), which is rather 
undesirable for deep CNNs, as the power of deep neu¬ 
ral networks largely derives from learning with large-scale 
datasets. Moreover, the summation over pairwise simi¬ 
larities between data points makes mini-batch stochastic 
gradient descent (SGD) more difficult, whereas mini-batch 
SGD is crucial to the training effectiveness of deep net¬ 
works. While prior work based on MMD (Pan et al., 2011; 
Tzeng et al., 2014) rarely addresses this issue, we believe it 
is critical in the context of deep learning. In this paper, we 
adopt the unbiased estimate of MK-MMD (Gretton et al., 
2012b) which can be computed with linear complexity. 
More specifically, dl{p,q) = 9k where 

we denote quad-tuple = (x 2 j_i,x^j,X 2 j_ 2 ,x^J, and 
evaluate multi-kernel function k on each quad-tuple by 
gk (zO = fc(x®,_i,x^j-ffc(x‘,_;^,x‘j-fc(x^,_;^,x*2j- 
fc(x 2 j, X 2 j_i). This approach computes an expectation of 
independent variables as in (1) with cost 0{n). 

When we train deep CNN by mini-batch SGD, we only 
need to consider the gradient of objective (4) with respect to 
each data point x^. Since the linear-time MK-MMD takes 
a nice summation form that can be readily decoupled into 
the sum of pfc(zi)’s, we only need to compute the gradients 
for the quad-tuple zf = h‘^,) of 

the fth layer hidden representation. To be consistent with 
the gradient of MK-MMD, we need to compute the cor¬ 
responding gradient of CNN risk , where J (z^) = 
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J [9 (x“), y“,), and {(x“ , ?/“,)} indicates the labeled 
examples in quad-tuple —for instance, in unsupervised 
adaptation where the target domain has no labeled data, we 

have {(x“,, ?/“,)} = (x|i, t/lJ}. Toper- 

form a mini-batch update, we compute the gradient of ob¬ 
jective (4) with respect to the fth layer parameter 0^ as 


dJ (zi) 


99k (z-) 

de^ 


(5) 


Such a mini-batch SGD can be easily implemented within 
the Caffe framework for CNNs (Jia et al., 2014). Given 
kernel k as the linear combination of m Gaussian kernels 
{ku (xi,Xj) = the gradient ^ can 

be readily computed using the chain rule. For instance. 






xK-i-hg) 


( 6 ) 


X I 




s(£-i) 

^22-1 


-I 


^2i 




where the last row computes the gradient of the fth layer 
rectifier units, with I being defined as an indicator such that 
I [h^-i] = if Wj hf-i ^ 0, else I = 0. 


Learning (3 The proposed multi-layer adaptation regular- 
izer performs layerwise matching by MK-MMD, hence we 
seek to learn optimal kernel parameter (3 for MK-MMD by 
jointly maximizing the test power and minimizing the Type 
II error (Gretton et al., 2012b), leading to the optimization 


max4 


(7) 


where 4 = Ez4 (z) — [E^gk (z)]^ is estimation variance. 
Letting d = {di,d 2 , ■ ■ ■, dmY, each is MMD via kernel 
ku- Covariance Q = cov {gk) € can be computed 

in 0{rn?n) cost, i.e. Qw = 9k^ 9k^, 

where Zi = {z 2 i-i,Z 2 i) and g^^ (z^) = gk^ (z 2 *-i) - 
gk^ (z2i)- Hence (7) reduces to a quadratic program (QP), 

min (3^ (Q + el) f3, (8) 

dT/3=l,/35sO 


where e = 10“^ is a small regularizer to make the prob¬ 
lem well-defined. By solving (8), we obtain a multi-kernel 
^ = SIILi Puku that jointly maximizes the test power and 
minimizes the Type II error. 

We note that the DAN objective (4) is essentially a minimax 
problem; i.e., we compute minmax4 The 

CNN parameter 0 is learned by minimizing MK-MMD as 
a domain discrepancy, while the MK-MMD parameter (3 is 
learned by minimizing the Type II error. Both criteria are 
dedicated to an effective adaptation of domain discrepancy. 


aiming to consolidate the transferability of DAN features. 
We accordingly adopt an alternating optimization that up¬ 
dates 0 by mini-batch SGD (5) and (3 by QP (8) iteratively. 
Both updates cost 0{n) and are scalable to large datasets. 


3.3. Analysis 

We provide an analysis of the expected target-domain risk 
of our approach, making use of the theory of domain adap¬ 
tation (Ben-David et al., 2007; 2010; Mansour et al., 2009) 
and the theory of kernel embedding of probability distribu¬ 
tions (Sriperumbuduret al., 2009; Gretton et al., 2012a;b). 

Theorem 1 Let 6 G 33 be a hypothesis, 6^(9) and et(9) be 
the expected risks of source and target respectively, then 

et{9) i^es{9) + 2dk{p.,q) + C, (9) 

where C is a constant for the complexity of hypothesis 
space and the risk of an ideal hypothesis for both domains. 


Proof sketch: A result from Ben-David et al. (2007) shows 
that et{9) < es{9) -f duip, q) + Cq, where d-uip, q) is the 
-divergence between p and g, which is defined as 


dn{p,q) = 2 sup Pr [? 7 (x®) = 1] - Pr [^(x*) = l] . 

17GW ^ ~P 

( 10 ) 

The -divergence relies on the capacity of the hypothesis 
space TL to distinguish distributions p from g, and rj G H 
can be viewed as a two-sample classifier. By choosing -q as 
a (kernel) Parzen window classifier (Sriperumbuduret al., 
2009), d-u {p, g) can be bounded by the empirical estimate 


duip.q) < dniVs^Vt) + Ci 


^ 2 ( 1 - inf 


i=i j=i 


+Ci 


— 2 (1 + dk{p, q)) 3- Cl, 

( 11 ) 

where L( ) is the linear loss function of the Parzen window 
classifier q, L[q = 1] = —q, L[q = —1] = q. By explicitly 
minimizing MK-MMD in multiple layers, the features and 
classifier learned by the proposed DAN model can decrease 
the upper bound on target risk. The source classifier and the 
two-sample classifier together provide a way to assess the 
adaptation performance, and can facilitate model selection. 
Note that we maximize MK-MMD w.r.t. (3 (7) to minimize 
Type II test error, and to help the Parzen window classifier 
achieve minimal risk of two-sample discrimination in (11). 


4. Experiments 

We compare the DAN model to state-of-the-art transfer 
learning and deep learning methods on both unsupervised 
and semi-supervised adaptation problems, focusing on the 
efficacy of multi-layer adaptation with multi-kernel MMD. 
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4.1. Setup 

Offlce-31 (Saenko et al., 2010) This dataset is a standard 
benchmark for domain adaptation. It consists of 4,652 im¬ 
ages within 31 categories collected from three distinct do¬ 
mains: Amazon (A), which contains images downloaded 
from amazon . com. Webcam (W) and DSLR (D), which 
are images taken by web camera and digital SLR camera in 
an office with different environment variation, respectively. 
We evaluate our method across the 3 transfer tasks, A —> 
W, D —> W and W —> D, which are commonly adopted in 
deep learning methods (Donahue et al., 2014; Tzeng et al., 
2014). For completeness, we further include the evaluation 
on the other 3 transfer tasks, A —!• D, D —> A and W A. 
Office-10 + Caltech-10 (Gong et al., 2012). This dataset 
consists of the 10 common categories shared by the Office- 
31 and Caltech-256 (C) (Griffin et al., 2007) datasets and 
is widely adopted in transfer learning methods (Long et al., 
2013; Baktashmotlagh et al., 2013). We can build another 
6 transfer tasks: A ^ C, W ^ C, D C, C ^ A, C ^ W, 
and C D. With more transfer tasks, we are targeting an 
unbiased look at the dataset bias (Torralba & Efros, 2011). 

We compare to a variety of methods: TCA (Pan et al., 

2011) , GFK (Gongetal., 2012), CNN (Krizhevsky et al., 

2012) , LapCNN (Weston et al., 2008), and DDC 
(Tzeng et al., 2014). Specifically, TCA is a conventional 
transfer learning method based on MMD-regularized PCA. 
GFK is a widely-adopted method for our datasets which 
interpolates across intermediate subspaces to bridge the 
source and target. CNN was the leading method in the 
ImageNet 2012 competition, and it turns out to be a strong 
model for learning transferable features (Yosinski et al., 
2014). LapCNN is a semi-supervised variant of CNN 
based on Laplacian graph regularization. Finally, DDC is a 
domain adaptation variant of CNN that adds an adaptation 
layer between the fc7 and /c 8 layers that is regularized 
by single-kernel MMD. We implement the CNN-based 
methods, i.e., CNN, LapCNN, DDC, and DAN based on 
the Caffe (Jia et al., 2014) implementation of AlexNet 
(Krizhevsky et al., 2012) trained on the ImageNet dataset. 
In order to study the efficacy of multi-layer adaptation and 
multi-kernel MMD, we evaluate several variants of DAN: 
(1) DAN using only one hidden layer, either fc7 or /c 8 
for adaptation, termed DANy and DANg respectively; (2) 
DAN using single-kernel MMD for adaptation, termed 

DANsk. 

We mainly follow standard evaluation protocol for unsu¬ 
pervised adaptation and use all source examples with labels 
and all target examples without labels (Gong et al., 2013). 
To make our results directly comparable to most published 
results, we report a classical protocol (Saenko et al., 2010) 
in that we randomly down-sample the source examples, 
and further require 3 labeled target examples per category 


for semi-supervised adaptation. We compare the averages 
and standard errors of classihcation accuracy for each task. 
For baseline methods, we follow the standard procedures 
for model selection as explained in their respective papers. 
For MMD-based methods (i.e., TCA, DDC, and DAN), 
we use a Gaussian kernel fc(xi,Xj) = 
with the bandwidth 7 set to the median pairwise distances 
on the training data—the median heuristic (Gretton et al., 
2012b). We use multi-kernel MMD for DAN, and con¬ 
sider a family of m Gaussian kernels by varying 

bandwidth yu between 2“®7 and 2®7 with a multiplica¬ 
tive step-size of 2^/^ (Gretton et al., 2012b). As minimiz¬ 
ing MMD is equivalent to maximizing the error of clas¬ 
sifying the source from the target (two-sample classiher) 
(Sriperumbuduret al., 2009), we can automatically select 
the MMD penalty parameter A on a validation set (com¬ 
prised of source-labeled instances and target-unlabeled in¬ 
stances) by jointly assessing the test errors of the source 
classiher and the two-sample classiher. We use the hne- 
tuning architecture (Yosinski et al., 2014), however, due to 
limited training examples in our datasets, we hx convo¬ 
lutional layers convl-conv3 that were copied from pre¬ 
trained model, hne-tune conv4-conv5 and fully connected 
layers /c6-/c7, and train classiher layer /c 8 , both via back 
propagation. As the classiher is trained from scratch, we 
set its learning rate to be 10 times that of the lower lay¬ 
ers. We use stochastic gradient descent (SGD) with 0.9 
momentum and the learning rate annealing strategy imple¬ 
mented in Caffe, and cross-validate base learning rate be¬ 
tween 10 “® and 10 “^ with a multiplicative step-size 10 ^/^. 

4.2. Results and Discussion 

The unsupervised adaptation results on the hrst six Office- 
31 transfer tasks are shown in Table 1, and the results 
on the other six Office-10 ■¥ Caltech-10 transfer tasks are 
shown in Table 2. To directly compare with DDC, we re¬ 
port semi-supervised adaptation results of the same tasks 
used by DDC in Table 3. We can observe that DAN sig- 
nihcantly outperforms the comparison methods on most 
transfer tasks, and achieves comparable performance on the 
easy transfer tasks, D —W and W — D, where source and 
target are similar (Saenko et al., 2010). This is reasonable 
as the adaptability may vary across different transfer tasks. 
The performance boost demonstrates that our architecture 
of multi-layer adaptation via multi-kernel MMD is able to 
transfer pre-trained deep models across different domains. 

From the experimental results, we can make the follow¬ 
ing observations. (1) Deep learning based methods signif¬ 
icantly outperform conventional shallow transfer learning 
methods by a large margin. (2) Among the deep learn¬ 
ing methods, the semi-supervised LapCNN provides no 
improvement over CNN, suggesting that the challenge of 
domain discrepancy cannot be readily bridged by semi- 
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Table 1. Accuracy on Office-31 dataset with standard unsupervised adaptation protocol (Gong et al., 2013). 


Method 

A^ W 

D^ W 

W^D 

A^D 

D^ A 

A 

Average 

TCA 

21.5 ±0.0 

50.1 ±0.0 

58.4 ±0.0 

11.4 ±0.0 

8.0 ±0.0 

14.6 ± 0.0 

27.3 

GFK 

19.7 ±0.0 

49.7 ± 0.0 

63.1 ±0.0 

10.6 ±0.0 

7.9 ±0.0 

15.8 ±0.0 

27.8 

CNN 

61.6 ±0.5 

95.4 ±0.3 

99.0 ± 0.2 

63.8 ±0.5 

51.1 ±0.6 

49.8 ± 0.4 

70.1 

LapCNN 

60.4 ± 0.3 

94.7 ± 0.5 

99.1 ± 0.2 

63.1 ±0.6 

51.6 ±0.4 

48.2 ± 0.5 

69.5 

DDC 

61.8 ±0.4 

95.0 ±0.5 

98.5 ±0.4 

64.4 ±0.3 

52.1 ±0.8 

52.2 ± 0.4 

70.6 

DAN7 

63.2 ±0.2 

94.8 ± 0.4 

98.9 ±0.3 

65.2 ±0.4 

52.3 ±0.4 

52.1 ±0.4 

71.1 

DANg 

63.8 ±0.4 

94.6 ± 0.5 

98.8 ±0.6 

65.8 ±0.4 

52.8 ±0.4 

51.9 ±0.5 

71.3 

DANsk 

63.3 ±0.3 

95.6 ±0.2 

99.0 ± 0.4 

65.9 ±0.7 

53.2 ±0.5 

52.1 ±0.4 

71.5 

DAN 

68.5 ± 0.4 

96.0 ± 0.3 

99.0 ± 0.2 

67.0 ± 0.4 

54.0 ± 0.4 

53.1 ± 0.3 

72.9 


Table 2. Accuracy on Office-10 + Caltech-10 dataset with standard unsupervised adaptation protocol (Gong et al., 2013). 


Method 

A^C 

W^C 

D ^C 

C-> A 

C^ w 

C ^D 

Average 

TCA 

42.7 ± 0.0 

34.1 ±0.0 

35.4 ±0.0 

54.7 ± 0.0 

50.5 ± 0.0 

50.3 ± 0.0 

44.6 

GFK 

41.4 ±0.0 

26.4 ± 0.0 

36.4 ± 0.0 

56.2 ±0.0 

43.7 ±0.0 

42.0 ± 0.0 

41.0 

CNN 

83.8 ±0.3 

76.1 ±0.5 

80.8 ± 0.4 

91.1 ±0.2 

83.1 ±0.3 

89.0 ±0.3 

84.0 

LapCNN 

83.6 ±0.6 

77.8 ±0.5 

80.6 ± 0.4 

92.1 ± 0.3 

81.6 ±0.4 

87.8 ± 0.4 

83.9 

DDC 

84.3 ± 0.5 

76.9 ± 0.4 

80.5 ± 0.2 

91.3 ±0.3 

85.5 ±0.3 

89.1 ±0.3 

84.6 

DAN 7 

84.7 ±0.3 

78.2 ±0.5 

81.8 ±0.3 

91.6 ±0.4 

87.4 ±0.3 

88.9 ± 0.5 

85.4 

DANg 

84.4 ± 0.3 

80.8 ± 0.4 

81.7 ±0.2 

91.7 ±0.3 

90.5 ± 0.4 

89.1 ±0.4 

86.4 

DANsk 

84.1 ±0.4 

79.9 ± 0.4 

81.1 ±0.5 

91.4 ±0.3 

86.9 ±0.5 

89.5 ± 0.3 

85.5 

DAN 

86.0 ± 0.5 

81.5 ± 0.3 

82.0 ± 0.4 

92.0 ±0.3 

92.0 ± 0.4 

90.5 ± 0.2 

87.3 


Table 3. Accuracy on Office-31 dataset with classic unsupervised 
and semi-supervised adaptation protocols (Saenko et al., 2010). 


Method 

A-^ W 

D 

W^D 

Average 

DDC 

59.4 ±0.8 

92.5 ±0.3 

91.7 ±0.8 

81.2 

DAN 

66.0 ± 0.4 

93.5 ± 0.2 

95.3 ±0.3 

84.9 

DDC 

84.1 ±0.6 

95.4 ± 0.4 

96.3 ±0.3 

91.9 

DAN 

85.7 ±0.3 

97.2 ± 0.2 

96.4 ± 0.2 

93.1 


supervised learning. (3) DDC, a cross-domain variant of 
CNN with single-layer adaptation via single-kernel MMD, 
generally outperforms CNN, confirming its effectiveness in 
learning transferable features using domain-adaptive deep 
models. Note that while DDC based on Caffe AlexNet was 
shown to significantly outperform DeCAF (Donahue et al., 
2014) in which fine-tuning was not carried out, it does not 
yield a large gain over Caffe AlexNet using fine-tuning. 
This shows the limitation of single-layer adaptation via 
single-kernel MMD, which cannot explore the strengths of 
deep networks and multiple kernels for domain adaptation. 

To dive deeper into DAN, we present the results of three 
variants of DAN: (1) DANy and DANg achieve better ac¬ 
curacy than DDC, which highlights that multi-kernel MMD 
can bridge the domain discrepancy more effectively than 
single-kernel MMD. The reason is that multiple kernels 
with different bandwidths can match both the low-order 
moments and high-order moments to minimize the Type II 


error (Gretton et al., 2012 b). ( 2 ) DANsk also attains higher 
accuracy than DDC, which confirms the capability of deep 
architecture for distribution adaptation. The rationale is 
similar to that of deep networks: each layer of deep net¬ 
work is intended to extract features at a different abstraction 
level, and hence we need to match the distributions at each 
task-specific layer to consolidate the adaptation quality at 
all levels. The multi-layer architecture is one of the most 
critical contributors to the efficacy of deep learning, and we 
believe it is also important for MMD-based adaptation. The 
evidence of comparable performance between the multi¬ 
layer variant DANsk and multi-kernel variants DAN 7 and 
DANg shows their equal importance for domain adaptation. 
As expected, DAN obtains the best performance by jointly 
exploring multi-layer adaptation with multi-kernel MMD. 
Another benefit of DAN is that it uses a linear-time unbi¬ 
ased estimate of the kernel embedding, which makes it an 
order more efficient than existing methods TCA and DDC. 
Though Tzeng et al. (2014) speed up DDC by computing 
the MMD within each mini-batch of the SGD, this leads to 
a biased estimate of MMD and lower adaptation accuracy. 

4.3. Empirical Analysis 

Feature Visualization To demonstrate the transferabil¬ 
ity of the DAN learned features, we follow Donahue et 
al. (2014) and Tzeng et al. (2014) and plot in Figures 2(a)- 
2(b) and 2(c)-2(d) the t-SNE embeddings of the images 
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(a) DDC Features on Source (b) DDC Features on Target (c) DAN Features on Source (d) DAN Features on Target 
Figure 2. Feature visualization: t-SNE of DDC features on source (a) and target (b); t-SNE of DAN features on source (c) and target (d). 



Task k 

(a) ^-Distance (b) Accuracy vs. A 

Figure 3. Empirical analysis: (a) .4-Distance of CNN & DAN fea¬ 
tures; (b) sensitivity of A (dashed lines show best baseline results). 


in task C —W with DDC features and DAN features, re¬ 
spectively. We make the following observations: (1) With 
DDC features, the target points are not discriminated very 
well, while with DAN features, the points are discriminated 
much better. (2) With DDC features, the categories be¬ 
tween the source and the target are not aligned very well, 
while with DAN features, the categories are aligned much 
better between domains. Both these observations can ex¬ 
plain the superior performance of DAN over DDC; (1) im¬ 
plies that the target points are more easily discriminated 
with DAN features, and (2) implies that the target points 
can be better discriminated with the source classifier. DAN 
can learn more transferable features for effective domain 
adaptation. 

Al-Distance A theoretical result in Ben-David et al. (2010) 
suggests A-distance as a measure of domain discrepancy. 
As computing the exact A-distance is intractable, an ap¬ 
proximate distance is defined as = 2 (1 — 2e), where e 
is the generalization error of a two-sample classifier (ker¬ 
nel SVM in our case) trained on the binary problem to 
distinguish input samples between the source and target 
domains. Figure 3(a) displays on transfer tasks A 
W and C —> W using Raw features, CNN features, and 
DAN features, respectively. It reveals a surprising obser¬ 
vation that the djx on both CNN and DAN features are 
larger than the on Raw features. This implies that ab¬ 


stract deep features can be salient both for discriminating 
different categories and different domains, which is consis¬ 
tent with Glorot et al. (2011). However, domain adaptation 
may be deteriorated by the enlarged domain discrepancy 
(Ben-David et al., 2010). It is desirable that d^ on DAN 
feature is smaller than dj^ on CNN feature, which guaran¬ 
tees more transferable features. 

Parameter Sensitivity We investigate the effects of the 
parameter A. Figure 3(b) gives an illustration of the 
variation of transfer classification performance as A G 
{0.1,0.4,0.7,1,1.4,1.7, 2} on tasks A -> W and C -> W. 
We can observe that the DAN accuracy first increases and 
then decreases as A varies and demonstrates a bell-shaped 
curve. This confirms the motivation of jointly learning deep 
features and adapting distribution discrepancy, since a good 
trade-off between them can enhance feature transferability. 

5. Conclusion 

In this paper, we have proposed a novel Deep Adaptation 
Network (DAN) architecture to enhance the transferability 
of features from task-specific layers of the neural network. 
We confirm that while general features can generalize well 
to a novel task, specific features tailored to an original task 
cannot bridge the domain discrepancy effectively. We show 
that feature transferability can be enhanced substantially by 
mean-embedding matching of the multi-layer representa¬ 
tions across domains in a reproducing kernel Hilbert space. 
An optimal multi-kernel selection strategy further improves 
the embedding matching effectiveness, while an unbiased 
estimate of the mean embedding naturally leads to a linear¬ 
time algorithm that is very desirable for deep learning from 
large-scale datasets. An extensive empirical evaluation on 
standard domain adaptation benchmarks demonstrates the 
efficacy of the proposed model against previous methods. 

As deep features transition from general to specific along 
the network, it is interesting to study the principled way of 
deciding the boundary of generality and specificity, and the 
application of distribution adaptation to the convolutional 
layers of CNN to further enhance the feature transferability. 
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